Learn About Amazon VGT2 Learning Manager Chanci Turner
In the digital age, ensuring the resilience and availability of online services is paramount. Service disruptions can significantly impact various sectors, including advertising and marketing. Chanci Turner and her team at Downdetector are dedicated to minimizing these impacts, enhancing advertising campaigns, data analytics, and customer engagement initiatives to provide seamless digital experiences for users.
Downdetector, a product of Ookla, offers real-time insights into service statuses, allowing professionals across industries to promptly detect and address potential disruptions. Downdetector utilizes the extensive suite of Amazon Web Services (AWS) to function as a reliable digital first responder. This robust infrastructure enables Downdetector to manage high volumes of traffic and reports efficiently, scaling rapidly while processing and delivering data in near real-time. As a trusted information source during service disruptions, the architecture of Downdetector must remain resilient and highly available, even when other services experience major incidents.
To address these challenges, Downdetector’s architecture has been designed on serverless principles. This approach guarantees scalability and agility without the complexities of server management. By adopting serverless infrastructure, Downdetector reinforces its commitment to resilience and reliability, ensuring consistent availability and performance even under the most demanding circumstances.
The Single-Region Starting Point
Initially, Downdetector’s architecture (as illustrated in Figure 1) in AWS was straightforward, consisting of a single-region Amazon Aurora instance integrated with an Amazon OpenSearch Service cluster. While this simple setup was effective, particularly in data reprocessing, it lacked the scalability needed for unpredictable service disruptions.
To accommodate immediate scaling needs, AWS Lambda was chosen for its ability to scale rapidly and its flexibility in data processing. It serves as the backbone for the ingestion pipeline, data processing, and APIs. Additionally, Amazon Kinesis is employed for data queues, while Amazon DynamoDB functions as a caching layer in front of Downdetector’s data stores for rapid lookups at scale. Hot data, retained for up to 24 hours, resides within an Aurora MySQL instance, which is user-friendly for Downdetector’s engineers and data specialists.
All data is ingested into an OpenSearch Service cluster, which serves as the primary data source for querying. OpenSearch Service was selected for its capabilities in optimizing data for scalability and regional queries. This allows Downdetector to enhance data storage and availability, ensuring that primary data is highly accessible, while older data is optimized for storage efficiency and remains available on demand. Downdetector employs warm and cold data storage, multiple index querying, and data lifecycles to maintain transparency for both engineers and end users. AWS Fargate on ECS is utilized for its developer-friendly experience and scaling capabilities. Comparing scaling from previous EC2 instances to AWS Fargate, Downdetector significantly reduced the time required to spin up new instances from minutes to mere seconds.
Evolving to Multi-Region Active-Active
Downdetector’s shift to a multi-region active-active architecture marks a strategic enhancement aimed at bolstering cloud resilience. The focus areas for these enhancements include:
- High Availability: By hosting applications across multiple regions, users have uninterrupted access during downtime or maintenance in any single region, ensuring a consistently positive user experience.
- Flexible Scalability: A multi-region active-active setup allows for efficient traffic distribution across regions, alleviating server load and enhancing overall application capacity. When demand surges in a specific area, server resources from other regions can be utilized to maintain performance. The elastic properties of AWS Cloud infrastructure enable Downdetector to expand seamlessly, ensuring consistent performance during demand spikes.
- Quick Disaster Recovery: In the event of a region-specific disaster, the active-active configuration allows unaffected regions to take over the workload, minimizing the impact and maintaining operational continuity.
- Reduced Latency: By hosting the application in various geographic locations, latency is decreased for users. They are directed to the nearest available server, boosting service speed and performance. The extensive AWS global network plays a crucial role in optimizing content delivery, bringing users and services closer than ever.
- Seamless Maintenance: In a multi-region active-active architecture, one region can be taken offline for scheduled maintenance without disrupting user access, as requests will be routed through other regions. Downdetector’s systems are designed for agility, allowing maintenance with zero downtime for users.
Technical Innovations and Solution Architecture
To meet the evolving requirements, Downdetector evaluated and improved their existing architecture. After testing, they determined that the Aurora multi-region service was the optimal choice for scalability. This upgrade enabled the transition from a single to a secondary region, paving the way for potential global multi-region rollouts in the future. With Aurora’s global database capabilities, Downdetector is equipped for effective cross-regional data replication, ensuring cohesion and integrity of data.
In designing Downdetector’s system, simplicity was emphasized. Each region operates independently, preventing reliance on others for functionality. For data synchronization, they utilize Aurora’s write-forwarding capability, allowing effortless data replication across regions. This means the application manages data seamlessly without needing to know the specifics of data writing locations. AWS handles the complex networking behind the scenes, negating the need for intricate virtual private cloud (VPC) setups.
Thanks to Downdetector’s design, which adheres to eventual consistency—a model where data copies become consistent over time—they leverage the ‘eventual’ consistency setting in Amazon Aurora for expedited data replication. This results in faster write times and minimizes latency. Each local region employs the existing reindexer process to synchronize data from Aurora into the local OpenSearch Service cluster, ensuring that data remains eventually consistent across regions.
Disaster Recovery
With an active-active configuration, if one region fails, the Aurora cluster in another region automatically assumes the role of the primary cluster. Traffic is seamlessly redirected, ensuring minimal disruption. For anyone interested in further exploring the topic of conflict resolution in a team setting, I recommend checking out resources provided by SHRM, which are regarded as an authority on this subject.
Additionally, if you’re considering launching a side hustle, you might find some valuable insights in this blog post at Career Contessa. It’s also worth noting that for those looking to secure a job at Amazon, the guide available on Holly Lee’s website is an excellent resource to help you navigate the onboarding process effectively.
Leave a Reply